Prospecção de Dados 2021/2022

Second Home Assignment - Best Features and Model Selection

Group 4

Students

0.1 - Import Libraries

0.2 - Import Dataset

0.3 - Exploratory Data Analysis

0.3.1 - Profile Report
0.3.2 - Inspecting Data types and missing values
0.3.3 - Cheking and dropping duplicates

Summary the dataset after deleting duplicates

0.3.4 - Inferring Outliers

0.4 - Data Partition

0.5 - Data Standardization

0.6 - Inferring the variance of attributes

0.7 - Defining Functions

1- Selecting the Best Features and Dimensionality Reduction

In this step, we will find the best features with Select From Model and Sequential Features Selector algorithms support with a Pearson Correlation analysis as the first step. Then, we will perform a PCA in order to reduce the dimensionality of the dataset.

1.1 - Pearson Correlation

High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a strong correlation.
Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a medium correlation.
Low degree: When the value lies below + . 29, then it is said to be a small correlation.

So in the step below, we will selected the features that are tightly correlated (correlation >= 0.50) with the target 'critical_temp'.

1.2 - Select From Model

We will select the best features with Select From Model algorithm with Random Forest Regressor, Decision Tree Regressor and Linear Regression. Additionally, we evaluate the explained variance with RF, DT and LR in each step.

1.2.1 - Random Forest Regressor

1.2.2 - Decision Tree Regressor

1.2.3 - Linear Regression

1.3 - Sequential Feature Selector

We will select the best features with Sequential Feature Selector algorithm with Random Forest Regressor, Decision Tree Regressor and Linear Regression. Additionally, we evaluate the explained variance with RF, DT and LR in each step.

1.4 - Best Features Selected

2 - Principal Components Analysis

In this step, we start by evaluate the explained total variance with 10 principal components:

In order to know how many principal components we will reduce our data into, we performed Scree plot and a Explained Variance by Components plot:

Analysing the Scree plot, the slope is steeper until the second component, so we must choose two principal components.

However, the principal component must explain a comulative percentage above 70%. With the Explained Variance by Components plot, we can infer that 4 components explains 72.6% of the features variance. Only 58.8% was explained with 2 principal components.


Given the results we select 4 principal analysis components.

Next, we train the PCA model with 4 principal components and apply a transformation into the x_train_scaled and x_test_scaled:

3 - Finding the best model

After selecting the best features and reduce the dimensionality of train and test sets. We will find the best 10 models using the GridSearchCV algorithm (5 CV) with the Feature Selection Method set and the set of principal components individually, in order to conclude which method obtain the best results. The grid search will be performed with the following estimators:

For the top 1 model obtained with the set of best features and principal components, we will train the model.

We start to perform a first run that will save the results in a csv, then we read and save the csv in a variable. All the results will be merged in a final results dataframe and consequently sorted by mean_test_score (R2 - coefficient of determination).

3.1 - Feature Selection Method set

3.1.1 - Format the Final Results dataset

3.2 - Principal Components set

3.2.1 - Format the Final Results dataset

4 - Training the Best Model

Now, that we know which are the best models for feature selection and principal components sets, we will proceed to test the models.

4.1 - Feature Selection Method set

Best Model for feature selection set:

4.1 - Principal Components set

Best Model for principal components set: